Test of geocoding
Problem Statement:
Between 2015 and our current datset there are 15676 missing geographies. We are attempting to geocode these geographies, using two different methods.
Technique 1:
If a value for x and y is present for the proceeding and following year & both those values are within .0001 of eachother, we use the average of those two values. (imputed)
Technique 2:
We geocode the addresses from openstreet maps. (geocoding)
Results:
The number of geographies filled by imputing is 4327 and the number of geographies filled by geocoding is 12198.
Imputed PINs:
Geocoded PINs:
Map of all PINs with results from both techniques:
To test how these work, we join them back together again and create a distance in ft collumn. This is the difference between observations with both imputed and geocoded techniques. The result shows that both techniques are very similar, although there are some differences.
Pins with Long Distances ( > 1000 ft):
The main (fixable) issue that I saw here was that directions of streets were not correct (359 W Ohio is likely 359 E Ohio).
Histogram of Distance (ft) Between Techniques
The vast majority of pins are within 300 feet of eachother. An example of this would be Address: 1346 W CULLERTON ST 3 CHICAGO IL 60608, CHICAGO, IL. These are 185 feet apart, and three houses apart. While not ideal, it definetely seems to be an improvement over no information.
What this does not answer is if PINs which have only one one technique are harder to code (weirder addresses). There is no intuitive reason why this would be the case, but it is something we should think about.
Conclusion:
In general, the two techniques produce similar results. I’d have to go through them a bit more in depth, but at the moment, the geocoding looks a bit better, specifically because it fills more observations. On the other hand, the imputed values are more likely to be correct, as we know we have some issues with our address data.